--- title: "Assignment Question 1 and Question 2" author: "Yang Xu" date: '2020-07-25T21:13:14-05:00' categories: R tags: - R Markdown - plot ---

1 Overview

1.1 Background of the case

Mini-Challenge 2 is to analyze movement and tracking data. GAStech provides many of their employees with company cars for their personal and professional use, but unbeknownst to the employees, the cars are equipped with GPS tracking devices. You are given tracking data for the two weeks leading up to the disappearance, as well as credit card transactions and loyalty card usage data. From this data, can you identify anomalies and suspicious behaviors? Can you identify which people use which credit and loyalty cards?

1.2 Literature Review / Motivation

Inspired by the 2014 VAST Challenge, I have learned many techniques from the previous submission. For example, Central South University has used the choropleth map to reveal the interaction between Employee, location as well as time elements.

Most of the submissions also use map trajectories to detect the users of the credit card by matching traveling patterns and staying locations.

However, it is often the case that the visuals are not interactive hence the readers might be affected by the resolution of the image and size of the fonts.

Therefore, this write-up will help to provide more interactive plots to enable the ease of interactivity in visual analytics.

2 Data Description and Preparation

Data Preprocessing

Question 1

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend correcting these anomalies? Please limit your answer to 8 images and 300 words.

#The top few spots by visit 
df_sum <- df1 %>%
  group_by(location) %>%
  summarise(count = n())

plot1 <- ggplot(df_sum, 
                aes(x = reorder(location, count), 
                    y = count,
                    text = paste("<br>", "Count: ", count))
                )+
  geom_bar(stat = "identity",
           color = "black",
           fill = "lightblue") +
  coord_flip() 

ggplotly(plot1,
         tooltip = c('text'), height = 600, width = 800) 

Location visits by the time period

#location visit by the time period 
df3 <- df1 %>%
  group_by(location, period) %>%
  summarise(n = n())

plot2 <- ggplot(df3,
                aes(x = n,
                    y = location)) +
  geom_bar(stat = "identity",
           color = "black") +
  geom_col(aes(fill = period)) + 
  scale_fill_brewer(palette = "Spectral") +
  theme_light()

ggplotly(plot2, height = 600, width = 800)

For a more detailed breakdown, the individual graphs of each coffee cafe is ploted. It is observed that

#credit card by hour and location 
#library(devtools)
#("thomasp85/patchwork")
#library(patchwork)
list <- c("Kronos Mart", "Bean There Done That", "Brewed Awakenings","Jack's Magical Beans")

df2 <- df1 %>%
  group_by(location, hour) %>%
  summarise(n = n())

p1 <-
       ggplot(df2[df2$location =="Brewed Awakenings",],
       aes(x = hour, y = n)) +
       geom_bar(stat = "identity",
                  color = "black",
                  fill = "lightblue") + 
    ggtitle("Brewed Awakenings")

  p2 <-
       ggplot(df2[df2$location =="Bean There Done That",],
       aes(x = hour, y = n)) +
       geom_bar(stat = "identity",
                  color = "black",
                  fill = "lightblue") + 
    ggtitle("Bean There Done That")

  p3 <-
       ggplot(df2[df2$location =="Jack's Magical Beans",],
       aes(x = hour, y = n)) +
       geom_bar(stat = "identity",
                  color = "black",
                  fill = "lightblue") + 
    ggtitle("Jack's Magical Beans")
  
  
  p4 <-
       ggplot(df2[df2$location =="Brew've Been Served",],
       aes(x = hour, y = n)) +
       geom_bar(stat = "identity",
                  color = "black",
                  fill = "lightblue") + 
    ggtitle("Brew've Been Served")
  
  (p1 + p2)/(p3+p4)

transaction boxplot by location

Some outliers are revealed by the

ggplotly(
  ggplot(df1[df1$Category == 'F&B',], aes(x = location,
                y = price,
                #group = Category,
                color = location,
                text = paste("Price: ", price, "<br>", "CC: ", last4ccnum, "<br>", "Timestamp: ", timestamp))) + 
    #facet_grid(rows = vars(Category)) + 
    geom_jitter(size = 0.5) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)),
  tooltip = c('text')
)
ggplotly(
  ggplot(df1[df1$Category == 'Life',], aes(x = location,
                y = price,
                #group = Category,
                color = location,
                text = paste("Price: ", price, "<br>", "CC: ", last4ccnum, "<br>", "Timestamp: ", timestamp))) + 
    #facet_grid(rows = vars(Category)) + 
    geom_jitter(size = 0.5) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)),
  tooltip = c('text')
)
ggplotly(
  ggplot(df1[df1$Category == 'Work',], aes(x = location,
                y = price,
                #group = Category,
                color = location,
                text = paste("Price: ", price, "<br>", "CC: ", last4ccnum, "<br>", "Timestamp: ", timestamp))) + 
    #facet_grid(rows = vars(Category)) + 
    geom_jitter(size = 0.5) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)),
  tooltip = c('text')
)
#location visit by the time period 
df3 <- df1 %>%
  group_by(location, period) %>%
  summarise(n = n())

plot2 <- ggplot(df3,
                aes(x = n,
                    y = location)) +
  geom_bar(stat = "identity",
           color = "black") +
  geom_col(aes(fill = period)) + 
  scale_fill_brewer(palette = "Spectral") +
  theme_light()

ggplotly(plot2, height = 600, width = 800)

For a more detailed breakdown, the individual graphs of each coffee cafe is ploted. It is observed that

#credit card by hour and location 
#library(devtools)
#("thomasp85/patchwork")
#library(patchwork)
list <- c("Kronos Mart", "Bean There Done That", "Brewed Awakenings","Jack's Magical Beans")

df2 <- df1 %>%
  group_by(location, hour) %>%
  summarise(n = n())

p1 <-
       ggplot(df2[df2$location =="Brewed Awakenings",],
       aes(x = hour, y = n)) +
       geom_bar(stat = "identity",
                  color = "black",
                  fill = "lightblue") + 
    ggtitle("Brewed Awakenings")

  p2 <-
       ggplot(df2[df2$location =="Bean There Done That",],
       aes(x = hour, y = n)) +
       geom_bar(stat = "identity",
                  color = "black",
                  fill = "lightblue") + 
    ggtitle("Bean There Done That")

  p3 <-
       ggplot(df2[df2$location =="Jack's Magical Beans",],
       aes(x = hour, y = n)) +
       geom_bar(stat = "identity",
                  color = "black",
                  fill = "lightblue") + 
    ggtitle("Jack's Magical Beans")
  
  
  p4 <-
       ggplot(df2[df2$location =="Brew've Been Served",],
       aes(x = hour, y = n)) +
       geom_bar(stat = "identity",
                  color = "black",
                  fill = "lightblue") + 
    ggtitle("Brew've Been Served")
  
  #(p1 + p2)/(p3+p4)

transaction boxplot by location

ggplotly(
  ggplot(df1[df1$Category == 'F&B',], aes(x = location,
                y = price,
                #group = Category,
                color = location,
                text = paste("Price: ", price, "<br>", "CC: ", last4ccnum, "<br>", "Timestamp: ", timestamp))) + 
    #facet_grid(rows = vars(Category)) + 
    geom_jitter(size = 0.5) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)),
  tooltip = c('text')
)
ggplotly(
  ggplot(df1[df1$Category == 'Life',], aes(x = location,
                y = price,
                #group = Category,
                color = location,
                text = paste("Price: ", price, "<br>", "CC: ", last4ccnum, "<br>", "Timestamp: ", timestamp))) + 
    #facet_grid(rows = vars(Category)) + 
    geom_jitter(size = 0.5) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)),
  tooltip = c('text')
)
ggplotly(
  ggplot(df1[df1$Category == 'Work',], aes(x = location,
                y = price,
                #group = Category,
                color = location,
                text = paste("Price: ", price, "<br>", "CC: ", last4ccnum, "<br>", "Timestamp: ", timestamp))) + 
    #facet_grid(rows = vars(Category)) + 
    geom_jitter(size = 0.5) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)),
  tooltip = c('text')
)

Question 2

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.

Mapped using the same date and same amount of price. However, some of the credit card and loyalty data could not be mapped.

This suggests that there might be some time delay of the transactions and reflections of the details.

Furthermore, by interacting with the graph below, we can observe some loyalty card has transactions of multiple credit card.

#consistency between the loyaltycard data and cc data
ggplotly(
  ggplot(df1 %>% filter(loyaltynum != 'NA'),
       aes(x = datehour, y = last4ccnum, fill = loyaltynum)) +
         geom_tile(stat="identity", width=1, height=.9) 
)

For example, there are cases that one loyalty card could be matched to multiple credit card transactions. By clicking on the loyalty card number L8566, more than one credit card number showing same date and same amount matched.

By introducing the geo-data into the analysis. We can prove that there is some time lag of the transactions.

For example,we can look at the Kronos Mart transactions.

By overlaying the map and GPS data together and observe the transaction details, we can click on to the dot of the trajactories to look at the time of visit.

it is evident that there is a 12-hour time lag of the visit and reflections of transaction details.

tm_shape(ap) +
  tm_rgb(ap, r = 1, g = 2, b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255)

#shapefile consists of many files 
#Import vector GIS Data File 
abila_st <- st_read(dsn = 'MC2\\Geospatial',
                    layer = "Abila")
## Reading layer `Abila' from data source 
##   `C:\Study\MITB Term 3\ISSS608 Visual Analytics\cryshelleyx\R_Visual\content\post\2020-12-01-r-rmarkdown\MC2\Geospatial' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 3290 features and 9 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
## Geodetic CRS:  WGS 84
gps$Timestamp <- date_time_parse(gps$Timestamp,
                                 zone = "",
                                 format = "%m/%d/%Y %H:%M:%S")
gps$id = as_factor(gps$id)
gps$day <- as.factor(get_day(gps$Timestamp))
gps$hour <- strftime(gps$Timestamp, format = "%H")
gps$period <- cut(as.numeric(gps$hour),
                  breaks = c(0,6,12,14,19,20,23),
                  labels = c("midnight",
                             "morning",
                             "lunch",
                             "afternoon",
                             "dinner",
                             "nignt"))

gps_sf <- st_as_sf(gps,
                   coords = c("long", "lat"),
                   crs = 4326)

gps_path_selected <- gps_sf %>%
  filter(id == c('1', "5", "23")) 
tmap_mode("view")
tm_shape(ap) +
  tm_rgb(ap, r = 1, g = 2, b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) + 
  tm_shape(gps_path_selected) +
  tm_dots(col = 'id')
kronos <- df1[c("timestamp", "location", "price")] %>%
  filter(location == "Kronos Mart")
  
mutate(kronos, timestamp = format(timestamp,"%d/%m/%Y, %H:%M:%S")) %>%
         datatable(rownames=TRUE, filter="top", class = 'cell-border stripe')

Question 3

Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.

My methodology is to look at both the spending data and the gps location and try to look for patterns to match the datasets.

Patterns could be employees stay for a period of time and use their credit card at the location. By looking at evidences overtime and use logic of elimination, it is possible to deduct the location area and identify those not marked by the tourist map.

First of all, by observing a cholorpleth plot, we can find similar patterns for some vehicles.

Highlighted credit cards only have spending at the work related places, such as Abila Scrapyard. Hence we know that the owners of the credit card only use the vehicles for work purposes.

By looking at the map trajectories of the truck drivers, such as card ID 104, we can match that the corresponding credit card number is 8642.

Some of the problems identified could be:

For example, we could locate the Hippokampos through the card details and locations.